0%

(CVPR 2018) Visual Question Reasoning on General Dependency Tree

Cao Q, Liang X, Li B, et al. Visual question reasoning on general dependency tree[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7249-7257.



1. Overview


1.1. Motivation

  • recent work tried to explicit compositional processes to assemble multiple sub-task embedded in the questions, but rely on hand-craft or annotation

In this paper, it proposed ACMN (Adversarial Composition Modular Network)

  • adversarial attention module. local visual evidence for each word
  • residual composition module.
  • construct a dependency tree for each question
  • clausal predicate relation
  • modifer relation
  • enforce each parent node not explore new regions by masking out attentive regions of its child nodes at each step


1.2. Contribution

  • interpretable reasoning VQA system
  • adversarial attention module. enforce efficient visual evidence mining for modifier relations
  • residual composition module. integrat knowledge of child nodes for clausal predicate relations

1.3.1. VQA

  • CNN-LSTM
  • attention, stacked attention, co-attention
  • joint embedding
  • compact bilinear method

1.3.2. Reasoning Modal

  • database queries



2. Adversarial Composition Modular Network




  • generate tree by universal Stanford Parser
  • prune the leaf-nodes that are not noun
  • M. modifier relation
  • P. clausal predicate relation
  • x. node
  • x_c1, x_c2, …, x_cn. n child node of x
  • v. spatial feature through pre-trained network
  • w. word embedding through a Bi-LSTM
  • set of modules f. apply on each word; share weights
  • input of f. (v, w, child’s output)
  • output of f. (attention region att_out, hidden feature h_out)

2.1. Adversarial Attention Module

  • filter child nodes ∈ M
  • sum attention map of these child nodes
  • 1 - summation attention map to get mask
  • mask x v
  • output new attention map att_out
  • get h’ based on att_out

2.2. Residual Composition Module

  • sum hidden features h of child ∈ P
  • concat summation of h and h’
  • add it with all children‘ h

2.3. Overview



  • the nodes with modifier relations M can modify their parent node by referring to a more specific object
  • the nodes with clausal predicate relations P can enhance the representation



  • output feature h_root → 3 MLP to get y



3. Experiments




3.1. Dataset

  • CLEVR
  • Sort-of-CLEVR
  • VQAv2

3.2. Details

  • 224x224 image
  • max tree hight 13 for CLEVR

3.3. Comparison




3.4. Ablation Study